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Abstract 

Dataset storage, exchange, and access play a critical 
role in scientific applications. For such purposes netCDF 
serves as a portable and efficient file format and program- 
ming interface, which is popular in numerous scientific ap- 
plication domains. However, the original interface does not 
provide a efficient mechanism for parallel data storage and 
access. 

In this work, we present a new parallel interface for writ- 
ing and reading netCDF datasets. This interface is derived 
with minimum changes from the serial netCDF interface 
but defines semantics for parallel access and is tailored for 
high performance. The underlying parallel I/O is achieved 
through MPI-IO, allowing for dramatic performance gains 
through the use of collective I/O optimizations. We com- 
pare the implementation strategies with HDF5 and ana- 
lyze both. Our tests indicate programming convenience and 
significant I/O performance improvement with this parallel 
netCDF interface. 



1 Introduction 

Scientists have recognized the importance of portable 
and efficient mechanisms for storing large datasets created 
and used by their applications. The Network Common Data 
Form (netCDF) J5] is one such mechanism used by a 
number of applications. 

NetCDF intends to provide a common data access 
method for atmospheric science applications to deal with 
a variety of data types that encompass single-point obser- 
vations, time series, regularly spaced grids, and satellite or 
radar images 1 8 1. Today several organizations have adopted 
netCDF as a data access standard 1 19 1. 

The netCDF design consists of both a portable file for- 



mat and an easy-to-use application programming interface 
(API) for storing and retrieving netCDF files across multi- 
ple platforms. More and more scientific applications choose 
netCDF as their output file format. While these applica- 
tions become computational and data intensive, they tend to 
be parallelized on high-performance computers. Hence, it 
is highly desirable to have an efficient parallel programming 
interface to the netCDF files. Unfortunately, the original de- 
sign of netCDF interface is proving inadequate for parallel 
applications because of its lack of a parallel access mech- 
anism. In particular, there is no support for concurrently 
writing to a netCDF file. Hence, parallel applications oper- 
ating on netCDF files must serialize access. Traditionally, 
parallel applications write to netCDF files through one of 
the allocated process which easily becomes a performance 
bottleneck. The serial I/O access is both slow and cumber- 
some to the application programmer. 

To facilitate parallel I/O operations, we have defined a 
parallel API for concurrently accessing netCDF files. With 
minimum changes to the names and argument lists, this in- 
terface maintains the look and feel of the serial netCDF in- 
terface while the implementation underneath incorporates 
well-known parallel I/O techniques such as collective I/O 
to allow high-performance data access. We implement this 
work on top of MPI-IO, which is specified by MPI-2 stan- 
dard |3j El El an d is freely available on most platforms. 
Since MPI has become de facto parallel mechanism for 
communication and I/O on most parallel environments, this 
approach is portable across different platforms. 

Hierarchical Data Format version 5 (HDF5) |5 1 also pro- 
vides a portable file format and programming interfaces for 
storing multidimensional arrays together with ancillary data 
in a single file. It supports parallel I/O and its implementa- 
tion is also built on top of MPI-IO. However, the HDF5 
API is too flexible and cumbrous to become an easy-to- 
use standard since it adds more programming features and 
completely re-designs the API from its previous version. 
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Our parallel netCDF interface, on the other hand, is more 
concise, closer to the original API, and goes much closer 
to MPI-IO interface, which introduces less overhead while 
providing more optimization opportunities for performance 
enhancement. The goal of this work is to make the parallel 
netCDF interface a data access standard for parallel scien- 
tific applications. 

We run a couple of benchmarks using parallel netCDF 
and parallel HDF5, exploring both artificially made access 
patterns from our own benchmark and the ones from a real 
application called FLASH (Q. In our experiments, paral- 
lel netCDF brings significant I/O improvement and shows 
better performance than parallel HDF5 in the FLASH I/O 
benchmark 1 18 1. 

The rest of this paper is organized as follows. Section[2] 
reviews some related work. Section [3] presents the design 
background of netCDF and points out its potential usage in 
parallel scientific applications. The design and implementa- 
tion of our parallel netCDF is described in Section^ Exper- 
imental performance results are given in Section|5] Section 
|6]concludes the paper. 

2 Related Work 

Considerable research has been done on data access 
for scientific applications. The work has focused on data 
I/O performance and data management convenience. Two 
projects, MPI-IO and HDF, are most closely related to our 
research. 

MPI-IO is a parallel I/O interface specified in the MPI-2 
standard. It is implemented and used on a wide range of 
platforms. The most popular implementation, ROMIO 11171 
is implemented portably on top of an abstract I/O device 
layer 1 14 16 1 that enables portability to new underlying I/O 
systems. One of the most important features in ROMIO 
is collective I/O operations, which adopt a two-phase I/O 
strategy | r71 ll2l[T3lll5l and improve the parallel I/O perfor- 
mance by significantly reducing the number of I/O requests 
that would otherwise result in many small, noncontiguous 
I/O requests. However, MPI-IO reads and writes data in 
a raw format without providing any functionality to effec- 
tively manage the associated metadata. Nor does it guar- 
antee data portability, thereby making it inconvenient for 
scientists to organize, transfer, and share their application 
data. 

HDF is a file format and software, developed at NCSA, 
for storing, retrieving, analyzing, visualizing, and convert- 
ing scientific data. The most popular versions of HDF 
are HDF4 B and HDF5 |5|. The design goal of HDF4 
is mainly to deal with sequential data access and its APIs 
are consistent with its earlier versions. On the other hand, 
HDF5 is a major revision in which its APIs are completely 
re-designed. Both versions store multidimensional arrays 



together with ancillary data in portable, self-describing file 
formats. The support for parallel data access in HDF5 is 
built on top of MPI-IO, which ensures its portability since 
MPI-IO has become a de facto standard for parallel I/O. 
However, the fact that HDF5 file format is not compati- 
ble with HDF4 can be inconvenient for existing HDF4 pro- 
grammers to migrate their applications to HDF5. Further- 
more, HDF5 adds several new features, such as hierarchi- 
cal file structure, to describe more metadata, but it also in- 
creases the difficulties for the implementation of parallel 
data access underneath. And the overhead involved may 
make HDF5 perform much worse than its underlying MPI- 
IO. By using a number of scientific applications, this prob- 
lem is addressed in lISI llOI . 

3 NetCDF Background 

NetCDF is an abstraction that supports a view of data 
as a collection of self-describing, portable, array-oriented 
objects that can be accessed through a simple interface. It 
defines a file format as well as a set of programming inter- 
faces for storing and retrieving data in the form of arrays 
in netCDF files. We first describe the netCDF file format 
and its serial API and then consider various approaches to 
access netCDF files in parallel computing environments. 

3.1 File Format 

NetCDF stores data in an array-oriented dataset, which 
contains dimensions, variables, and attributes. Physically, 
the dataset file is divided into two parts: file header and ar- 
ray data. The header contains all information (or metadata) 
about dimensions, attributes, and variables except for the 
variable data itself, while the data part contains arrays of 
variable values (or raw data). 

The netCDF file header first defines a number of dimen- 
sions, each with a name and a length. These dimensions are 
used to define the shapes of variables in the dataset. One di- 
mension can be unlimited and is used as the most significant 
dimension (record dimension) for growing-size variables. 

Following the dimensions, a list of named attributes are 
used to describe the properties of the dataset (e.g., data 
range, purpose, associated applications ). These are called 
global attributes and are separate from attributes associated 
with individual variables. 

The basic units of named data in a netCDF dataset are 
variables, which are multidimensional arrays. The header 
part describes each variable by its name, shape, named at- 
tributes, data type, array size, and data offset, while the data 
part stores the array values for one variable after another, in 
their defined order. 

To support variable-size arrays (e.g., data growing with 
time stamps), netCDF introduces record variables and uses 
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Figure 1 . NetCDF file structure: there is a file 
header containing metadata of the stored ar- 
rays, then the fixed-size arrays are laid out in 
the following contiguous file space in a linear 
order, with variable-size arrays appending at 
the end of the file in an interleaved pattern. 



a special technique to store such data. All record variables 
share the same unlimited dimension as their most signif- 
icant dimension and are expected to grow together along 
that dimension. The rest, less significant dimensions all 
together define the shape for one record of the variable. 
For fixed-size array, each array is stored in a contiguous 
file space starting from a given offset. For variable-size ar- 
rays, netCDF first defines a record of an array as a subar- 
ray comprising all fixed dimensions and the records of all 
such arrays are stored interleaved in the arrays' defined or- 
der. Figure illustrates the storage layouts for fixed-size 
and variable-size arrays in a netCDF file. 

In order to achieve network transparency (machine- 
independence), both the header and data parts of the file 
are represented in an well-defined format similar to XDR 
(eXternal Data Representation) but extended to support ef- 
ficient storage of arrays of non-byte data. 

3.2 Serial NetCDF API 

The original netCDF API was designed for serial codes 
to perform netCDF operations through a single process. In 
the serial netCDF library, a typical sequence of operations 
to write a new netCDF dataset is to create the dataset; define 
the dimensions, variables and attributes; write variable data; 



and close the dataset. Reading an existing netCDF dataset 
involves first opening the dataset; inquiring about dimen- 
sions, variables, and attributes; reading variable data; and 
closing the dataset. 

These netCDF operations can be divided into the follow- 
ing five categories. Refer to 1 8 1 for details of each function 
in the netCDF library. 

(1) Dataset Functions: create/open/close a dataset, 
set the dataset to define/data mode, and synchro- 
nize dataset changes to disk 



(2) Define Mode Functions: 

sions and variables 



define dataset dimen- 



(3) Attribute Functions: manage adding, changing, 
and reading attributes of datasets 

(4) Inquiry Functions: return dataset metadata: 
dim(id, name, len), var(name, ndims, shape, id) 

(5) Data Access Functions: provide the ability to 
read/write variable data in one of the five ac- 
cess methods: single value, whole array, subarray, 
subsampled array (strided subarray) and mapped 
strided subarray 

The I/O implementation of the serial netCDF API is built 
on the native I/O system calls and has its own buffering 
mechanism in user space. Its design and optimization tech- 
niques are suitable for serial access but are not efficient 
or even not possible for parallel access, nor do they allow 
further performance gains provided by modern parallel I/O 
techniques. 

3.3 Using NetCDF in Parallel Environments 

Today most scientific applications are programmed to 
run in parallel environments due to the increasing require- 
ments on data amount and computational resources. It is 
highly desirable to develop a set of parallel APIs for access- 
ing netCDF files that employs appropriate parallel I/O tech- 
niques. In the meantime, programming convenience is also 
important, since scientific users may desire to spend mini- 
mum effort on dealing with I/O operations. Before present- 
ing our design on parallel netCDF, we would like to discuss 
current approaches for using netCDF in parallel programs 
in a message-passing environment. 

The first and most straightforward approach is described 
in the scenario of Figure 0a) in which one process is in 
charge of collecting/distributing data and performing I/O to 
a single netCDF file using the serial netCDF API. The I/O 
requests from other processes are carried out by shipping all 
the data through this single process. The drawback of this 
approach is that collecting all I/O data on a single process 
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Figure 2. Using netCDF in parallel programs: (a) use serial netCDF API to access single files through 
a single process; (b) use serial netCDF API to access multiple files concurrently and independently; 
(c) use new parallel netCDF API to access single files cooperatively or collectively. 



can easily cause an I/O performance bottleneck and may 
overwhelm its memory capacity. 

To avoid unnecessary data shipping, an alternative ap- 
proach is to have all processes perform their I/O indepen- 
dently using the serial netCDF API, as shown in Figure|2jb). 
In this case, all netCDF operations can proceed concur- 
rently, but over multiple files, one for each process. How- 
ever, using multiple files to store a single netCDF dataset 
results the complexity and difficulty of data management. 
This approach also destructs the purpose of netCDF design 
on easy data integration and management. 

A third approach introduces a new set of APIs with par- 
allel access semantics and optimized parallel I/O implemen- 
tation such that all processes perform I/O operations coop- 
eratively or collectively through the parallel netCDF library 
to access a single netCDF file. This approach, as shown in 
Figure|2c), both frees the users from dealing with details of 
parallel I/O and provides more opportunities for employing 
various parallel I/O optimizations in order to obtain higher 
performance. We discuss the details of this parallel netCDF 
design and implementation in the next section. 

4 Parallel NetCDF 

To facilitate convenient and high-performance parallel 
access to netCDF files, we define a new parallel interface 
and provide a prototype implementation. Since a large num- 
ber of existing users are running their applications over 
netCDF, our parallel netCDF design retains the original 
netCDF file format (version 3) and introduces minimum 
changes from the original interface. We distinguish the par- 
allel API from the original serial API by prefixing the C 
function calls with "ncmpL" and the Fortran function calls 
with "nfmpL". 



4.1 Interface Design 

Our parallel netCDF API is built on top of MPI-IO. The 
parallel netCDF built on MPI-IO can benefit from several 
well-known optimizations already used in existing MPI-IO 
implementations, such as data sieving and two-phase I/O 
strategies El El 03 021 in ROMIO. Figure describes 
the overall architecture for our design. 

In parallel netCDF, a file is opened, operated, and closed 
by the participating processes in a communication group. In 
order for these processes to operate on the same file space, 
especially the structural information contained in the file 
header, a number of changes have been made to the orig- 
inal serial netCDF API. 

For the function calls that create/open a netCDF file, an 
MPI communicator is added in the argument list to define 
the participating I/O processes within the file's open and 
close scope. MPUnfo object is also added to pass user ac- 
cess hints to the MPI-IO for further optimizations. By de- 
scribing the collection of processes with a communicator, 
we provide the underlying implementation with information 
that can be used to ensure file consistency. The MPUnfo 
hint provides users the ability to deliver the high level ac- 
cess information to netCDF and MPI-IO libraries, such as 
file access patterns and file system specifics to direct opti- 
mization. 

We keep the same syntax and semantics for the parallel 
netCDF define mode functions, attribute functions, and in- 
quiry functions as the original ones. These functions are 
also made collective to guarantee consistency of dataset 
structure among the participating processes in the same MPI 
communication group. For instance, the define mode func- 
tions is required to be called by the processes with the same 
values. 

The major effort of this work is the parallelization of 
the data access functions. We provide two sets of data ac- 
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Figure 3. Design of parallel netCDF on a par- 
allel I/O architecture. Parallel netCDF runs as 
a library between user space and file system 
space. It processes parallel netCDF requests 
from user compute nodes and, after optimiza- 
tion, passes the parallel I/O requests down to 
MPI-IO library, and then the I/O servers re- 
ceive the MPI-IO requests and perform I/O 
over the end storage on behalf of the user. 



(a) WRITE: 

1 ncmpi_create(mpi_comm, filename, 0, mpi_info, &file_id); 

2 ncmpi_def_var(file_id, ...); 
ncmpi_enddef (f ile_id) ; 

3 ncmpi_put_vara_all(file_id, var_id, 

startQ, countfl, 
buffer, bufcount, 
mpi_datatype); 

4 ncmpi_close(file_id); 

(b) READ: 

1 ncmpi_open(mpi_comm, filename, 0, mpi_info, &file_id); 

2 ncmpi_inq(file_id, ... ); 

3 ncmpi_get_vars_all(file_id, var_id, 

startf], countf], stride[], 
buffer, bufcount, 
mpi_datatype); 

4 ncmpi_close(file_id); 



Figure 4. Example of using parallel netCDF. 
Typically there are 4 main steps: 1. col- 
lectively create/open the dataset; 2. collec- 
tively define the dataset by adding dimen- 
sions, variables and attributes in WRITE, or 
inquiry about the dataset to get metadata as- 
sociated with the dataset in READ; 3. access 
the data arrays (collective or non-collective); 
4. collectively close the dataset. 



cess APIs: a high-level API that mimics the serial netCDF 
data access functions and serves an easy path for original 
netCDF users to migrate to the parallel interface, and a 
flexible API that provides a more MPI-like style of access. 
Specifically, the flexible API uses more MPI functionality in 
order to provide better handling of internal data representa- 
tions and to more fully expose the capabilities of MPI-IO to 
the application programmer. The major difference between 
the two is the use of MPI derived data types. We believe 
using MPI derived datatypes can better illustrate the access 
patterns than the subarray mapping methods used in origi- 
nal API. 

The most important change from the original netCDF 
interface with respect to data access functions is the split 
of data mode into two distinct modes: collective and non- 
collective data modes in which collective function names 
end with "_all". Similar to MPI-IO, the collective functions 
are synchronous across the processes in the communica- 
tor associated to the opened netCDF file, while the non- 
collective functions are not. Using collective operations 
can provide the underlying parallel netCDF implementa- 
tion an opportunity to further optimize access to the netCDF 
file. These optimizations are performed without further in- 
tervention by the application programmer and have been 
proven to provide dramatic performance improvement in 
multidimensional dataset access IT5l . Figure|4]gives an ex- 



ample code of using our parallel netCDF API to write and 
read a dataset using collective I/O. 

4.2 Parallel Implementation 

The parallel API implementation is discussed in two 
parts: header I/O and parallel data I/O. We first describe 
out implementation strategies for dataset functions, define 
mode functions, attribute functions, and inquiry functions 
that access the netCDF file header. 

4.2.1 Access to File Header 

Internally, the header is read/written only by a single 
process, although a copy is cached in local memory on 
each process. The define mode functions, attribute func- 
tions, and inquiry functions all work on the local copy of 
the file header. Since they are all in-memory operations not 
involved in any file I/O, they bear few changes from the 
serial netCDF API. They are made collective, but this fea- 
ture does not necessarily imply inter-process synchroniza- 
tion. In some cases, however, when the header definition is 
changed synchronization is needed to verify that the values 
passed in by all processes match. In all possible cases we 
allow inter-process communications. 

The dataset functions, unlike the other functions cited, 
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need complete reimplementation because they are in 
charge of collectively opening/creating datasets, perform- 
ing header I/O and file synchronization for all processes, 
and managing inter-process communication. We build these 
functions over MPI-IO so that they have better portabil- 
ity and provide more optimization opportunities. The ba- 
sic idea is to let the ROOT process fetch the file header, 
broadcast it to all processes when opening a file, and write 
the file header at the end of definition if any modifica- 
tion occurs in the header part. Since all define mode and 
attribute functions are collective and require all processes 
in the communicator to provide the same arguments when 
adding/removing/changing definitions, the local copies of 
the file header shall be the same across all processes once 
the file is collectively opened and until it is closed. 

4.2.2 Parallel I/O for Array Data 

Since the majority of time spent accessing a netCDF file 
is in data access, the data I/O must be efficient. By imple- 
menting the data access functions above MPI-IO, we enable 
a number of advantages and optimizations. 

For each of the five data access methods in the flexible 
data access functions, we represent the data access pattern 
as an MPI file view (a set of data visible and accessible 
from an open file Q), which is constructed from the vari- 
able metadata (shape, size, offset, etc.) in the netCDF file 
header and start[], count[], stride[], imap[], mpLdatatype 
arguments provided by users. For parallel access, particu- 
larly for collective access, each process has a different file 
view; and all processes in combination can make a single 
MPI-IO request to transfer large contiguous data as a whole, 
thereby preserving useful semantic information that would 
otherwise be lost if the transfer were expressed as per pro- 
cess noncontiguous requests. 

The high-level data access functions are implemented 
in terms of the flexible data access functions, so that ex- 
isting users migrating from serial netCDF can also bene- 
fit from the MPI-IO optimizations. However, the flexible 
data access functions are closer to MPI-IO and hence incur 
less overhead. They accept a user-specified MPI derived 
datatype and pass it directly to MPI-IO for optimal handling 
of in-memory data access patterns. 

In some cases (for instance, in record variable access) 
the data is stored interleaved by record and the contiguity 
information is lost, so the existing MPI-IO collective I/O 
optimization may not help. In that case, we need more opti- 
mization information from users, such as the number, order, 
and record indices of the record variables they will access 
consecutively. With such information we can collect multi- 
ple I/O requests over a number of record variables and opti- 
mize the file I/O over a large pool of data transfers, thereby 
producing more contiguous and larger transfers. This kind 



of information is passed in as an MPUnfo hint when a user 
opens or creates a netCDF dataset. We implement our user 
hints in parallel netCDF for all such specific optimization 
points, while a number of standard hints are passed down 
for MPI-IO to take control of optimal parallel I/O behav- 
iors. Thus experienced users have the opportunity to tune 
their applications for further performance gains. 

4.3 Advantages and Disadvantages 

There are a number of advantages within the design 
and implementation of our parallel netCDF, as compared 
to other related work, like HDF5. 

First of all, the parallel netCDF design and implementa- 
tion is optimized for the netCDF file format so that the data 
I/O performance is as good as the MPI-IO. The NetCDF 
file chooses linear data layout, in which the data arrays are 
either stored in contiguous space and in a predefined order 
or interleaved in a regular pattern. This regular and highly 
predictable data layout enables the parallel netCDF data 
I/O implementation to simply pass the data buffer, metadata 
(fileview, mpLdatatype, etc.), and other optimization infor- 
mation to MPI-IO, and all parallel I/O operations are car- 
ried out in the same manner as when MPI-IO alone is used. 
Thus, there is very little overhead, and the parallel netCDF 
performance should be nearly the same as MPI-IO if only 
raw data I/O performance is compared. On the other hand, 
parallel HDF5 uses tree-like file structure that are similar 
to the UNIX file system and the data is dispersedly laid 
out using super block, header blocks, data blocks, extended 
header blocks and extended data blocks. This irregular lay- 
out pattern may make it difficult to pass user access pattern 
directly to MPI-IO especially for the case of variable-size 
arrays. Instead, parallel HDF5 uses dataspace and hyper- 
slabs to define the data organization, map and transfer data 
between memory space and the file space and does buffer 
packing/unpacking in a recursive way, while these can oth- 
erwise be directly handled by MPI-IO in a more efficient 
and optimized way. 

Secondly, the parallel netCDF implementation manages 
to keep the overhead involved in header I/O as low as possi- 
ble. In the netCDF file, there is only one header which con- 
tains all necessary information for direct access of each data 
array and each array is associated with a predefined, numer- 
ical ID that can be efficiently inquired when it is needed 
to access the array. So, by maintaining a local copy of 
the header on each process, our implementation saves a lot 
of inter-process synchronization as well as avoids repeated 
access of the file header each time the header information 
is needed to access a single array. All header information 
can be accessed directly in local memory and inter-process 
synchronization is needed only during the definition of the 
dataset. And once the definition of the dataset is created, 
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each array can be identified by its permanent ID and ac- 
cessed at any time by any process, without any collective 
open/close operation. However, in HDF5, the header meta- 
data is dispersed in separate header blocks for each object 
and, in order to operate on an object, it has to iterate through 
the entire namespace to get the header information of that 
object and then open, access and close it. This kind of ac- 
cess method may be inefficient for parallel access, since the 
parallel HDF5 designs the open/close of each object as col- 
lective operations, which force all participating processes 
to communicate when accessing one single object, not to 
mention the cost of file access to locate and fetch the header 
information of that object. 

Lastly, the programming interface of the parallel netCDF 
is concise and designed for easy usage, and the file format is 
fully compatible with serial netCDF. Porting existing serial 
netCDF application to parallel netCDF should be straight- 
forward because the parallel API contains nearly all func- 
tions of the serial API with parallel semantics but with min- 
imum change of function names and argument lists. 

However, there are also limitations in parallel netCDF. 
Unlike HDF5, netCDF does not support hierarchical group 
based organization of data objects and since it lays out the 
data in a linear order, adding fixed-size array or extending 
the file header may be very costly once the file is created 
and has existing data stored, though moving the existing 
data to extended area is performed in parallel. Also, parallel 
netCDF does not provide functionality to combine two or 
more files in memory through software mounting, as HDF5 
does. Nor does netCDF support data compression within its 
file format. Fortunately, these features can all be achieved 
by external software, sacrificing some manageability of the 
files. 



5 Performance Evaluation 



To evaluate the performance and scalability of our par- 
allel netCDF with that of serial netCDF, we ran some ex- 
periments and compared the results. We also compared the 
performance of parallel netCDF with that of parallel HDF5, 
using the FLASH I/O benchmark. 

The experiments were run on an IBM SP-2 machine. 
This system is a teraflop-scale clustered SMP with 144 com- 
pute nodes. Each compute node has 4 GB of memory shared 
among its eight 375 MHz Power3 processors. All the com- 
pute nodes are interconnected by switches and also con- 
nected via switches to the multiple I/O nodes running the 
GPFS parallel file system. There are 12 I/O nodes, each 
with dual 222 MHz processes. The aggregate disk space is 
5 TB and the peak I/O bandwidth is 1.5 GB/s. 
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Figure 5. Various 3-D array partitions on 8 pro- 
cessors 



5.1 Scalability Analysis 

We wrote a test code (in C language) to evaluate the per- 
formance of the current implementation of parallel netCDF. 
This test code was originally developed in Fortran by 
Woo-sun Yang and Chris Ding at Lawrence Berkeley Na- 
tional Laboratory (LBL). Basically it reads/writes a three- 
dimensional array field tt(Z,Y,X) from/into a single netCDF 
file, where Z=level is the most significant dimension and 
X=longitude is the least significant dimension. The test 
code partitions the three dimensional array along Z, Y, X, 
ZY, ZX, YX, and ZYX axes, respectively, as illustrated in 
Figure [5] All data I/O operations in these tests used col- 
lective I/O. For comparison purpose, we prepared the same 
test using the original serial netCDF API and ran it in serial 
mode, in which a single processor reads/writes the whole 
array. 

Figure [6] shows the performance results for reading and 
writing 64 MB and 1 GB netCDF datasets. Generally, the 
parallel netCDF performance scales with the number of pro- 
cesses. Because of collective I/O optimization, the perfor- 
mance difference made by various access patterns is small, 
although partitioning in the Z dimension generally performs 
better than in the X dimension because of the different 
access contiguity. The overhead involved is inter-process 
communication, which is negligible comparing to the disk 
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Figure 6. Serial and parallel netCDF performance for 64 MB and 1 GB datasets. The first column of 
each chart shows the I/O performance of reading/writing the whole array through a single processor 
using serial netCDF; the rest of the columns show the results using parallel netCDF. 



I/O when using large file size. The I/O bandwidth does 
not scale in direct proportion because the number of I/O 
nodes (and disks) is fixed so that the dominating disk ac- 
cess time at I/O nodes is almost fixed. As expected, the par- 
allel netCDF outperforms the original serial netCDF as the 
number of processes increases. The difference between the 
serial netCDF performance and the parallel netCDF perfor- 
mance with single processor is because of their different I/O 
implementations and different I/O caching/buffering strate- 
gies. In the serial netCDF case, if, as in Figure|3a), multi- 
processors were used and the ROOT processor needed to 
collect partitioned data and then perform the serial netCDF 
I/O, the performance would be much worse and decrease 
with the number of processors because of the additional 
communication cost and division of a large I/O request into 
a series of small requests. 

5.2 FLASH I/O Performance 

The FLASH I/O benchmark simulates the I/O pattern of 
an important scientific application called FLASH Q. It 
recreates the primary data structures in the FLASH code 
and produces a checkpoint file, a plotfile with centered data, 
and a plotfile with corner data, using parallel HDF5. Basi- 
cally, these three output files contains a series of multidi- 



mensional arrays, and the access pattern is simple (Block, 
*, ...), which is similar to the Z partition in Figure [5] In 
each of the files, the benchmark writes the related arrays 
in a fixed order from contiguous user buffers, respectively. 
The I/O routines in the benchmark are identical to the rou- 
tines used by FLASH, so any performance improvements 
made to the benchmark program will be shared by FLASH. 
In our experiments, in order to focus on the data I/O per- 
formance, we modified this benchmark, removed the part 
of code writing attributes, ported it to parallel netCDF, and 
observed the effect of our new parallel I/O approach. 

Figure shows the performance results of the FLASH 
I/O benchmark using parallel netCDF and parallel HDF5. 
We tested both small data size and large data size. The pa- 
rameters used in these two experiments are: (a) nxb = nyb 
= nzb = 8, nguard = 4, number of blocks = 80, and nvar 
= 24; (b) nxb = nyb = nzb = 16, nguard = 8, number of 
blocks = 80, and nvar = 24. Although both I/O libraries 
are built above MPI-IO, the parallel netCDF has much less 
overhead and outperforms parallel HDF5 by almost dou- 
bling the overall I/O rate. The extra overhead involved in 
the current release of HDF5 (version 5- 1 .4.3) includes inter- 
process synchronizations and file header access performed 
internally in parallel open/close of every dataset (analogous 
to a netCDF variable) and recursive handling of the hyper- 
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(a) Small Data Size (b) Large Data Size 

Figure 7. Performance of FLASH I/O benchmark using parallel HDF5 and parallel netCDF. The two 
experiments use different parameters so that the file sizes are different. Also the file sizes are varies 
with the number of processors. The I/O amount is 3MB * Number of Processors in (a), and 24MB * 
Number of Processors in (b). 



slab used for parallel access, which makes the packing of 
the hyperslabs into contiguous buffers take a relatively long 
time. 

6 Conclusion and Future Work 

In this work, we extend the serial netCDF interface to 
facilitate parallel access, and we provide an implementa- 
tion for a subset of this new parallel netCDF interface. By 
building on top of MPI-IO, we gain a number of interface 
advantages and performance optimizations users can bene- 
fit from by using this parallel netCDF package, as shown by 
our test results. So far, a number of users from LBL, ORNL, 
and University of Chicago are using our parallel netCDF li- 
brary. 

Future work involves developing a production-quality 
parallel netCDF API (for C, C++, Fortran, and other pro- 
gramming languages) and making it freely available to the 
high-performance computing community. Moreover, we 
need to develop a mechanism for matching the file organi- 
zation to access patterns, and we need to develop cross-file 
optimizations for addressing common data access patterns. 
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